-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Great addition, @annjawn. I just made an update for DocumentIntelligence which also rolls up paragraphs that are under the same SectionHeading to generate larger documents that are semantically similar per structure of the document. Also from previous experiments it seems that tables and paragraphs from the DI API overlap, so I used the spans to come up with an ordered list of non-overlapping text chunks. Would you mind having a look at https://github.com/LarsAC/langchain/tree/larsac/azure-di? |
Hey @LarsAC , yes technically LINES/WORDS/PARAGRAPHS will indeed overlap with TABLE. The idea for including TABLE is to provide a way for people to use that in Self-query. However, we may still want to include it in the text as well (it's probably a matter of just looking at I will definitely take a look at your updates 👍, though I am thinking if we should still provide some flexibility of how the user may want to retrieve the text from the doc. |
@annjawn Fully agree with the flexibility. I had also added a "switch" parameter to the constructor of the loader in order to let the user control how to parse the text. We could likely add more options in parallel. |
|
||
def __init__(self, client: Any, model: str): | ||
def __init__(self, client: Any, model: str, split_mode: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we give this default val, probably "page"? so this isn't a breaking change and default behavior doesn't change too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do default it to “page” in DocumentIntelligenceLoader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but can we have default here as well, in case this object is instantiated directly by a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can default "page" here as well @baskaryan
|
||
def _generate_docs(self, blob: Blob, result: Any) -> Iterator[Document]: | ||
for p in result.pages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if split mode is page should we just keep existing logic? is there value in parsing by paragraph and re-assembling pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baskaryan the idea of providing paragraphs as an option is to do chunking (splitting) as supported by the azure AI cognitive layout capabilities rather than having to do chunking again using, let’s say a Text Splitter. This would be helpful for generating embeddings of chunks (paragraphs) that will retain the semantic consistency of the text. We won’t reassemble the paragraphs back into pages if paragraph
is used rather we will keep it the way Doc intel’s layout extracts it. If the user specifies page
explicitly or just doesn’t pass the parameter at initialization then page
will be defaulted and entire page text will be generated per page. Hope this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is why not do something like
if self.split_mode == "page":
for p in result.pages:
...
elif self.split_mode == "paragraph":
for p in result.paragraphs:
...
to save us having to write logic for reassembling paragraphs into pages in the case that split mode is page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baskaryan right, I am actually doing this here. the result
object doesn't have each page's full text individually in the pages
attribute as it may seem, we actually construct pages by concatenating paragraphs
. The highest grouped entity that Doc intelligence goes up to is the entire document (all text from all pages concatenated into one) and then its per page paragraph (then lines, then words). The content
object in result
is combination of all text from all pages, so it's just easier to assemble per page by paragraph instead of trying to split content
into individual pages, but that assembly (of paragraphs) will only happen if self.split_mode == "page"
. Here's a structure for better explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attaching a sample JSON output from a 2 page document extracted via prebuilt-read
model.
file_path: str, | ||
client: Any, | ||
model: str = "prebuilt-document", | ||
split_mode: str = "page", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baskaryan here's where it's defaulted to page
, so it won't introduce any breaking change.
"type": "PAGE", | ||
}, | ||
) | ||
yield d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baskaryan here's page
vs paragraph
logic. If page
is used then we do subsequent collation of paragraphs into individual page's full text and specify "type": "PAGE"
in Document
. If paragraph
is used then we keep it as is and simply yield
with paragraphs in the Document
schema with "type": "PARAGRAPH"
Hi @annjawn may I understand what is the plan for this PR? Is the PR going to be updated and merged? |
Apologies for the slow review! Pr has some merge conflicts, happy to re-review if you'd like to resolve |
This PR introduces enhancements to the Azure Document Intelligence document loader.
split_mode
during initialization ofDocumentIntelligenceLoader
. This defaults topage
in which case the full text of the page will be returned. Ifparagraph
is used insplit_mode
then Documents will be returned in chunks by paragraphs. Paragraphs may be useful in generating embeddings in smaller chunks instead of having to split the full page text yet again.prebuilt-document
,prebuilt-layout
, orprebuilt-invoice
. This is useful for developers who intend to use tables with Self-query.type
key inDocument
metadata to help distinguish just page text vs paragraph vs tables withPAGE
,PARAGRAPH
,TABLE_HEADER
andTABLE_ROW
.Document
schema for self-query will still be needed which can be done with the help oftype
key (TABLE_HEADER
andTABLE_ROW
),table_index
, andpage
.Sample usage